Skip to content

Add doc-extract: lightweight DOCX/XLSX/PPTX/PDF text extraction#2

Open
splitbrain wants to merge 1 commit into
mainfrom
claude/document-extraction-library-S6FUQ
Open

Add doc-extract: lightweight DOCX/XLSX/PPTX/PDF text extraction#2
splitbrain wants to merge 1 commit into
mainfrom
claude/document-extraction-library-S6FUQ

Conversation

@splitbrain
Copy link
Copy Markdown
Owner

Summary

  • Adds a small PHP 8.1+ library that extracts plain text from DOCX, XLSX, PPTX and PDF files via a single entry point: ExtractorFactory::extract($path).
  • OOXML formats are parsed with splitbrain/php-archive (^1.4.2) + XMLReader — no ext-zip dependency. PDF uses smalot/pdfparser.
  • File-type detection by extension only (case-insensitive); unknown extensions throw UnsupportedFormatException, parse failures throw ExtractionException.
  • XLSX resolves shared strings + sheet names; PPTX honours sldIdLst order rather than slide-filename order; DOCX includes headers/footers.

Test plan

  • composer install resolves on PHP 8.4 without ext-zip
  • vendor/bin/phpunit — 22/22 tests pass, no warnings
  • Round-trip test extracts known strings from runtime-built DOCX, XLSX, PPTX and PDF fixtures
  • Extension routing test covers .docx/.DOCX/.xlsx/.pptx/.pdf, unknown extension, no extension, and legacy .doc
  • Corrupt-zip path raises ExtractionException cleanly (relies on splitbrain/php-archive 1.4.2 EOCD-detection fix)

https://claude.ai/code/session_01RsoVbR75m1VxviTGKKMk3X


Generated by Claude Code

Lightweight PHP library that extracts plain text from the four most
common document formats with minimal dependencies. OOXML files (DOCX,
XLSX, PPTX) are read via splitbrain/php-archive plus XMLReader, dropping
the ext-zip requirement. PDF uses smalot/pdfparser.

The public surface is ExtractorFactory::extract($path), routing on file
extension only. 22 PHPUnit tests build minimal fixtures at runtime and
assert text extraction, sheet/slide ordering, factory dispatch and the
corrupt-input error path.

https://claude.ai/code/session_01RsoVbR75m1VxviTGKKMk3X
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants